133 research outputs found
DSPatch: Dual Spatial Pattern Prefetcher
High main memory latency continues to limit performance of modern
high-performance out-of-order cores. While DRAM latency has remained nearly the
same over many generations, DRAM bandwidth has grown significantly due to
higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked
memory packaging (HBM). Current state-of-the-art prefetchers do not do well in
extracting higher performance when higher DRAM bandwidth is available.
Prefetchers need the ability to dynamically adapt to available bandwidth,
boosting prefetch count and prefetch coverage when headroom exists and
throttling down to achieve high accuracy when the bandwidth utilization is
close to peak. To this end, we present the Dual Spatial Pattern Prefetcher
(DSPatch) that can be used as a standalone prefetcher or as a lightweight
adjunct spatial prefetcher to the state-of-the-art delta-based Signature
Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of
modulated spatial bit-patterns. The key idea is to: (1) represent program
accesses on a physical page as a bit-pattern anchored to the first "trigger"
access, (2) learn two spatial access bit-patterns: one biased towards coverage
and another biased towards accuracy, and (3) select one bit-pattern at run-time
based on the DRAM bandwidth utilization to generate prefetches. Across a
diverse set of workloads, using only 3.6KB of storage, DSPatch improves
performance over an aggressive baseline with a PC-based stride prefetcher at
the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in
memory-intensive workloads and up to 26%). Moreover, the performance of
DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to
10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201
Bridging the Gap between Cosmic Dawn and Reionization favors Faint Galaxies-dominated Models
It has been claimed that traditional models struggle to explain the tentative
detection of the 21\,cm absorption trough centered at measured by the
EDGES collaboration. On the other hand, it has been shown that the EDGES
results are consistent with an extrapolation of a declining UV luminosity
density, following a simple power-law of deep Hubble Space Telescope
observations of galaxies. We here explore the conditions by which
the EDGES detection is consistent with current reionization and
post-reionization observations, including the neutral hydrogen fraction at
--, Thomson scattering optical depth, and ionizing emissivity at
. By coupling a physically motivated source model derived from
radiative transfer hydrodynamic simulations of reionization to a Markov Chain
Monte Carlo sampler, we find that it is entirely possible to reconcile the
high-redshift (cosmic dawn) and low-redshift (reionization) existing
constraints. In particular, we find that high contribution from low-mass halos
along with high photon escape fractions are required to simultaneously
reproduce cosmic dawn and reionization constraints. Our analysis further
confirms that low-mass galaxies produce a flatter emissivity evolution, which
leads to an earlier onset of reionization with gradual and longer duration,
resulting in a higher optical depth. While our faint-galaxies dominated models
successfully reproduce the measured globally averaged quantities over the first
one billion years, they underestimate the late redshift-instantaneous
measurements in efficiently star-forming and massive systems. We show that our
(simple) physically-motivated semi-analytical prescription produces consistent
results with the (sophisticated) state-of-the-art \thesan
radiation-magneto-hydrodynamic simulation of reionization.Comment: 14 pages, 6 figures. Accepted for publication in ApJ. Comments are
welcom
Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction
Long-latency load requests continue to limit the performance of
high-performance processors. To increase the latency tolerance of a processor,
architects have primarily relied on two key techniques: sophisticated data
prefetchers and large on-chip caches. In this work, we show that: 1) even a
sophisticated state-of-the-art prefetcher can only predict half of the off-chip
load requests on average across a wide range of workloads, and 2) due to the
increasing size and complexity of on-chip caches, a large fraction of the
latency of an off-chip load request is spent accessing the on-chip cache
hierarchy. The goal of this work is to accelerate off-chip load requests by
removing the on-chip cache access latency from their critical path. To this
end, we propose a new technique called Hermes, whose key idea is to: 1)
accurately predict which load requests might go off-chip, and 2) speculatively
fetch the data required by the predicted off-chip loads directly from the main
memory, while also concurrently accessing the cache hierarchy for such loads.
To enable Hermes, we develop a new lightweight, perceptron-based off-chip load
prediction technique that learns to identify off-chip load requests using
multiple program features (e.g., sequence of program counters). For every load
request, the predictor observes a set of program features to predict whether or
not the load would go off-chip. If the load is predicted to go off-chip, Hermes
issues a speculative request directly to the memory controller once the load's
physical address is generated. If the prediction is correct, the load
eventually misses the cache hierarchy and waits for the ongoing speculative
request to finish, thus hiding the on-chip cache hierarchy access latency from
the critical path of the off-chip load. Our evaluation shows that Hermes
significantly improves performance of a state-of-the-art baseline. We
open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources
Address translation is a performance bottleneck in data-intensive workloads
due to large datasets and irregular access patterns that lead to frequent
high-latency page table walks (PTWs). PTWs can be reduced by using (i) large
hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both
solutions have significant drawbacks: increased access latency, power and area
(for hardware TLBs), and costly memory accesses, the need for large contiguous
memory blocks, and complex OS modifications (for software-managed TLBs). We
present Victima, a new software-transparent mechanism that drastically
increases the translation reach of the processor by leveraging the
underutilized resources of the cache hierarchy. The key idea of Victima is to
repurpose L2 cache blocks to store clusters of TLB entries, thereby providing
an additional low-latency and high-capacity component that backs up the
last-level TLB and thus reduces PTWs. Victima has two main components. First, a
PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on
the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache
replacement policy prioritizes keeping TLB entries in the cache hierarchy by
considering (i) the translation pressure (e.g., last-level TLB miss rate) and
(ii) the reuse characteristics of the TLB entries. Our evaluation results show
that in native (virtualized) execution environments Victima improves average
end-to-end application performance by 7.4% (28.7%) over the baseline four-level
radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art
software-managed TLB, across 11 diverse data-intensive workloads. Victima (i)
is effective in both native and virtualized environments, (ii) is completely
transparent to application and system software, and (iii) incurs very small
area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings
Conventional virtual memory (VM) frameworks enable a virtual address to
flexibly map to any physical address. This flexibility necessitates large data
structures to store virtual-to-physical mappings, which leads to high address
translation latency and large translation-induced interference in the memory
hierarchy. On the other hand, restricting the address mapping so that a virtual
address can only map to a specific set of physical addresses can significantly
reduce address translation overheads by using compact and efficient translation
structures. However, restricting the address mapping flexibility across the
entire main memory severely limits data sharing across different processes and
increases data accesses to the swap space of the storage device, even in the
presence of free memory. We propose Utopia, a new hybrid virtual-to-physical
address mapping scheme that allows both flexible and restrictive hash-based
address mapping schemes to harmoniously co-exist in the system. The key idea of
Utopia is to manage physical memory using two types of physical memory
segments: restrictive and flexible segments. A restrictive segment uses a
restrictive, hash-based address mapping scheme that maps virtual addresses to
only a specific set of physical addresses and enables faster address
translation using compact translation structures. A flexible segment employs
the conventional fully-flexible address mapping scheme. By mapping data to a
restrictive segment, Utopia enables faster address translation with lower
translation-induced interference. Utopia improves performance by 24% in a
single-core system over the baseline system, whereas the best prior
state-of-the-art contiguity-aware translation scheme improves performance by
13%.Comment: To appear in 56th IEEE/ACM International Symposium on
Microarchitecture (MICRO), 202
Exploring real-world symptom impact and improvement in well-being domains for tardive dyskinesia in VMAT2 inhibitor-treated patients via clinician survey and chart review
Introduction: Two vesicular monoamine transporter 2 (VMAT2) inhibitors are approved in the United States (US) for the treatment of tardive dyskinesia (TD). There is a paucity of information on the impact of VMAT2 inhibitor treatment on patient social and physical well-being. The study objective was to elucidate clinician-reported improvement in symptoms and any noticeable changes in social or physical well-being in patients receiving VMAT2 inhibitors.
Methods: A web-based survey was offered to physicians, nurse practitioners, and physician assistants based in the US who prescribed valbenazine for TD within the past 24 months. Clinicians reported data from the charts of patients who met the inclusion criteria and were allowed to recall missing information.
Results: Respondents included 163 clinicians who reviewed charts of 601 VMAT2-treated patients with TD: 47% had TD symptoms in ≥2 body regions, with the most common being in the head or face and upper extremities. Prior to treatment, 93% of patients showed impairment in ≥1 social domain, and 88% were impaired in ≥1 physical domain. Following treatment, among those with improvement in TD symptoms (n = 540), 80% to 95% showed improvement in social domains, 90% to 95% showed improvement in physical domains, and 73% showed improvement in their primary psychiatric condition.
Discussion: In VMAT2-treated patients with TD symptom improvement, clinicians reported concomitant improvement in psychiatric disorder symptoms and in social and physical well-being. Regular assessment of TD impact on these types of domains should occur simultaneously with movement disorder ratings when evaluating the value of VMAT2 inhibitor therapy
Search for continuous gravitational wave emission from the Milky Way center in O3 LIGO--Virgo data
We present a directed search for continuous gravitational wave (CW) signals
emitted by spinning neutron stars located in the inner parsecs of the Galactic
Center (GC). Compelling evidence for the presence of a numerous population of
neutron stars has been reported in the literature, turning this region into a
very interesting place to look for CWs. In this search, data from the full O3
LIGO--Virgo run in the detector frequency band have been
used. No significant detection was found and 95 confidence level upper
limits on the signal strain amplitude were computed, over the full search band,
with the deepest limit of about at .
These results are significantly more constraining than those reported in
previous searches. We use these limits to put constraints on the fiducial
neutron star ellipticity and r-mode amplitude. These limits can be also
translated into constraints in the black hole mass -- boson mass plane for a
hypothetical population of boson clouds around spinning black holes located in
the GC.Comment: 25 pages, 5 figure
- …